Unsupervised Word Segmentation with Bi-directional Neural Language Model

نویسندگان

چکیده

We propose an unsupervised word segmentation model, in which for each unlabelled sentence sample, the learning objective is to maximize generation probability of given its all possible segmentations. Such a can be factorized into likelihood segment context recursive way. To capture both long- and short-term dependencies, we use bi-directional neural language model better extract features segment’s context. Two decoding algorithms were also developed combine from directions generate final at inference time, helps reconcile word-boundary ambiguities. Experimental results show that our context-sensitive achieved state-of-the-art different evaluation settings on various datasets Chinese, comparable result Thai.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation

Recurrent neural network(RNN) has been broadly applied to natural language processing(NLP) problems. This kind of neural network is designed for modeling sequential data and has been testified to be quite efficient in sequential tagging tasks. In this paper, we propose to use bi-directional RNN with long short-term memory(LSTM) units for Chinese word segmentation, which is a crucial preprocess ...

متن کامل

Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transc...

متن کامل

Bayesian Unsupervised Word Segmentation with Hierarchical Language Modeling

This paper proposes a novel unsupervised morphological analyzer of arbitrary language that does not need any supervised segmentation nor dictionary. Assuming a string as the output from a nonparametric Bayesian hierarchical n-gram language model of words and characters, “words” are iteratively estimated during inference by a combination of MCMC and an efficient dynamic programming. This model c...

متن کامل

Feature-based Neural Language Model and Chinese Word Segmentation

In this paper we introduce a feature-based neural language model, which is trained to estimate the probability of an element given its previous context features. In this way our feature-based language model can learn representation for more sophisticated features. We introduced the deep neural architecture into the Chinese Word Segmentation task. We got a significant improvement on segmenting p...

متن کامل

Unigram Language Model for Chinese Word Segmentation

This paper describes a Chinese word segmentation system based on unigram language model for resolving segmentation ambiguities. The system is augmented with a set of pre-processors and post-processors to extract new words in

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2022

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3529387